The html version of this notebook is in the link [https://drive.google.com/drive/folders/1-jkQBXFr9dOV5RLfjMmz7rl6BwK0B4PP?usp=sharing].
Coronavirus (COVID-19) is a disease that can spread from person to person through close contact and sometimes by airborne transmission (CDC, 2020b). Older adults and people with medical conditions are more likely to get infected and become severely ill by COVID-19 (CDC, 2020a). It has been a year since WHO decleared the COVID-19 outbreak as a Public Health Emergency on 30th Jan. 2020 (WHO, 2020). As of today 3rd Feb. 2021, COVID-19 has taken the lives of 2,237,636 people worldwide and there have been over 100 million confirmed cases (WHO, 2021). The average timeline from hospitalization to a severe condition is about 7 days (Wang et al., 2020).
This report aims to find the relationship between the total number of COVID-19 Deaths cases and the total number of Confirmed last week cases of each country and its WHO Region. It uses the dataset of COVID-19 cases country wise as of 27th Jul. 2020 made by 'imdevskp' from GitHub. The COVID Confirmed last week cases reflect its total number of cases over time till 20th Jul. 2020 and WHO Region reflects the location and population mobility that can affect the spread of COVID, eventually affects the death tolls for countries in the region. The relationship between Deaths and Confirmed last week can help to understand the motality risks of COVID-19 for each country and to respond with appropriate measures. If a country has a higher number of Confirmed last week cases, it is considered as the higher chance of more patients develop a severe symptoms in the upcoming week, and reports more Deaths cases. If a country is in a WHO Region with higher average confirmed cases, then it implies more Deaths cases in that country.
This project is not only helpful to raise the awareness of the general public on the COVID severity and its spread, but also to uncover the relationship behind the deaths of each country and confirmed cases and other factors. The culmulative number of Deaths cases of each country is considered as the dependent variable, the cumulative number of Confirmed last week cases and WHO Region of each country are considered as the main independent variables, while other factors will be introduced for further analysis. Besides Deaths, Confirmed last week and WHO Region, the dataset introduced in this report also contains information such as cumulative cases (Confirmed, Deaths, Recovered), daily update on cases (Active cases, New cases, New recovered, New deaths) and growth rates (Deaths / 100 Cases, Recovered / 100 Cases, Deaths / 100 Recovered,1 week % increase) for each country. Moreover, the project intend to introduce some other country level characteristics such as total population total_pop to better explain the relationship.
It is helpful to import pandas, numpy and datetime packages for data analysis. The dataframe covid contains 187 entries in total.
covid Dataframe:¶The covid.describe() shows a summary statistics of the dataframe. The dataset does not contain any Nan in every columns, which does not need further modifications. Notably, the column Deaths / 100 Recovered consists some infinite numbers which causes the mean of Deaths / 100 Recovered to be inf and standard deviation to be NaN. Since Deaths / 100 Recovered is culmulative number of deaths over culmulative number of recovered, there are possibilities that some countries do not currently have any recovered population. Thus, the inf values should be marked and NaN so that it will not affect the mean and standard deviation of Deaths / 100 Recovered.
#! pip install pandas_datareader
import matplotlib.colors as mplc
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm #for linear regression: sm.ols
from pandas_datareader import DataReader
%matplotlib inline
import datetime as date
covid=pd.DataFrame(pd.read_csv(
"/users/lucygu/downloads/country_wise_latest.csv"))
covid.info()
covid=covid.replace([np.inf, -np.inf], np.nan)
covid.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 187 entries, 0 to 186 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country/Region 187 non-null object 1 Confirmed 187 non-null int64 2 Deaths 187 non-null int64 3 Recovered 187 non-null int64 4 Active 187 non-null int64 5 New cases 187 non-null int64 6 New deaths 187 non-null int64 7 New recovered 187 non-null int64 8 Deaths / 100 Cases 187 non-null float64 9 Recovered / 100 Cases 187 non-null float64 10 Deaths / 100 Recovered 187 non-null float64 11 Confirmed last week 187 non-null int64 12 1 week change 187 non-null int64 13 1 week % increase 187 non-null float64 14 WHO Region 187 non-null object dtypes: float64(4), int64(9), object(2) memory usage: 22.0+ KB
| Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.870000e+02 | 187.000000 | 1.870000e+02 | 1.870000e+02 | 187.000000 | 187.000000 | 187.000000 | 187.000000 | 187.000000 | 182.000000 | 1.870000e+02 | 187.000000 | 187.000000 |
| mean | 8.813094e+04 | 3497.518717 | 5.063148e+04 | 3.400194e+04 | 1222.957219 | 28.957219 | 933.812834 | 3.019519 | 64.820535 | 40.558297 | 7.868248e+04 | 9448.459893 | 13.606203 |
| std | 3.833187e+05 | 14100.002482 | 1.901882e+05 | 2.133262e+05 | 5710.374790 | 120.037173 | 4197.719635 | 3.454302 | 26.287694 | 336.669357 | 3.382737e+05 | 47491.127684 | 24.509838 |
| min | 1.000000e+01 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+01 | -47.000000 | -3.840000 |
| 25% | 1.114000e+03 | 18.500000 | 6.265000e+02 | 1.415000e+02 | 4.000000 | 0.000000 | 0.000000 | 0.945000 | 48.770000 | 1.442500 | 1.051500e+03 | 49.000000 | 2.775000 |
| 50% | 5.059000e+03 | 108.000000 | 2.815000e+03 | 1.600000e+03 | 49.000000 | 1.000000 | 22.000000 | 2.150000 | 71.320000 | 3.580000 | 5.020000e+03 | 432.000000 | 6.890000 |
| 75% | 4.046050e+04 | 734.000000 | 2.260600e+04 | 9.149000e+03 | 419.500000 | 6.000000 | 221.000000 | 3.875000 | 86.885000 | 6.232500 | 3.708050e+04 | 3172.000000 | 16.855000 |
| max | 4.290259e+06 | 148011.000000 | 1.846641e+06 | 2.816444e+06 | 56336.000000 | 1076.000000 | 33728.000000 | 28.560000 | 100.000000 | 3259.260000 | 3.834677e+06 | 455582.000000 | 226.320000 |
The summary table indicates that the confirmed, recovered and active cases vary a lot from countries to countries as the standard deviations are big. The median of the total confirmed cases is at 5000 and 75% of the countries keep their total confirmed cases below 40,000. While the average total cases for each country until July 27th is at about 90,000. While This shows that there are some countries who have a large number of confirmed cases, which causes the mean to deviate a lot from the median. The same wide deviation between mean and median also appears in active and recovered cases. The death cases have a smaller standard deviation than confirmed cases. However, it is still noticeable that the mean is largely different from the median, even the 75-percentile. I further investigate the distributions in the analysis.
This section focused on the relationship between the independent variable Confirmed last week and the dependent varibale Deaths. Using the function of .describe(), we can see the summary statistics of two numerical variables and plot a histogram respectively.
covid['Deaths'].describe()
count 187.000000 mean 3497.518717 std 14100.002482 min 0.000000 25% 18.500000 50% 108.000000 75% 734.000000 max 148011.000000 Name: Deaths, dtype: float64
deaths=covid['Deaths'].copy()
#fine tune the graph
fig, ax = plt.subplots(figsize=(10,10))
deaths.plot(
kind="hist", y="Deaths", color='#1b48fc',
bins=100, legend=False, density=False, ax=ax
)#density=True--percentage, density=False--count
ax.set_facecolor((0.96, 0.96, 0.96))
fig.set_facecolor((0.96, 0.96, 0.96))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_ylabel('Count')
ax.set_xlabel('Deaths cases')
ax.set_title("The total COVID-19 death cases by countries until July 27th 2020")
covid.loc[covid['Deaths']>=20000]
| Country/Region | Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | WHO Region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | Brazil | 2442375 | 87618 | 1846641 | 508116 | 23284 | 614 | 33728 | 3.59 | 75.61 | 4.74 | 2118646 | 323729 | 15.28 | Americas |
| 61 | France | 220352 | 30212 | 81212 | 108928 | 2551 | 17 | 267 | 13.71 | 36.86 | 37.20 | 214023 | 6329 | 2.96 | Europe |
| 79 | India | 1480073 | 33408 | 951166 | 495499 | 44457 | 637 | 33598 | 2.26 | 64.26 | 3.51 | 1155338 | 324735 | 28.11 | South-East Asia |
| 85 | Italy | 246286 | 35112 | 198593 | 12581 | 168 | 5 | 147 | 14.26 | 80.64 | 17.68 | 244624 | 1662 | 0.68 | Europe |
| 111 | Mexico | 395489 | 44022 | 303810 | 47657 | 4973 | 342 | 8588 | 11.13 | 76.82 | 14.49 | 349396 | 46093 | 13.19 | Americas |
| 157 | Spain | 272421 | 28432 | 150376 | 93613 | 0 | 0 | 0 | 10.44 | 55.20 | 18.91 | 264836 | 7585 | 2.86 | Europe |
| 173 | US | 4290259 | 148011 | 1325804 | 2816444 | 56336 | 1076 | 27941 | 3.45 | 30.90 | 11.16 | 3834677 | 455582 | 11.88 | Americas |
| 177 | United Kingdom | 301708 | 45844 | 1437 | 254427 | 688 | 7 | 3 | 15.19 | 0.48 | 3190.26 | 296944 | 4764 | 1.60 | Europe |
From the summary statistics, the mean of the death tolls for a country is 3,497 while the median is 108. This indicates the distribution of total Deaths is right-skewed, which is consistent with the histogram. The range of the death tolls for each country is from 0 to 148,011, which indicates that there is a large variance for the death cases between countries. From the histogram, most of the coutries kept death tolls under 1000 cases. The reason for large difference between the mean and median of death tolls is that there are some outliers. For example, US, Brazil, Mexico and UK have death tolls over 40,000 and US accounts for the highest number of death cases around the world.
covid['Confirmed last week'].describe(
).apply(lambda x: format(x, 'f'))
count 187.000000 mean 78682.475936 std 338273.676567 min 10.000000 25% 1051.500000 50% 5020.000000 75% 37080.500000 max 3834677.000000 Name: Confirmed last week, dtype: object
clw=covid['Confirmed last week'].copy()
fig, ax = plt.subplots(figsize=(10,10))
clw.plot(
kind="hist", y="Confirmed last week", color='#1b48fc',
bins=100, legend=False, density=False, ax=ax #density=True--percentage, density=False--count
).get_xaxis().get_major_formatter().set_scientific(False)
ax.set_facecolor((0.96, 0.96, 0.96))
fig.set_facecolor((0.96, 0.96, 0.96))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_ylabel('Count')
ax.set_xlabel('Confirmed last week cases')
ax.set_title("The number of total COVID-19 confirmed cases last week by countries until July 20, 2020")
covid.loc[covid['Confirmed last week']>=500000]
| Country/Region | Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | WHO Region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | Brazil | 2442375 | 87618 | 1846641 | 508116 | 23284 | 614 | 33728 | 3.59 | 75.61 | 4.74 | 2118646 | 323729 | 15.28 | Americas |
| 79 | India | 1480073 | 33408 | 951166 | 495499 | 44457 | 637 | 33598 | 2.26 | 64.26 | 3.51 | 1155338 | 324735 | 28.11 | South-East Asia |
| 138 | Russia | 816680 | 13334 | 602249 | 201097 | 5607 | 85 | 3077 | 1.63 | 73.74 | 2.21 | 776212 | 40468 | 5.21 | Europe |
| 173 | US | 4290259 | 148011 | 1325804 | 2816444 | 56336 | 1076 | 27941 | 3.45 | 30.90 | 11.16 | 3834677 | 455582 | 11.88 | Americas |
From the summary statistics, the mean of the total confirmed cases till last week for a country is 78,682 while the median is 5,020. This indicates the distribution of total confirmed last week is right-skewed as well, which is consistent with the histogram. The range of the confirmed cases for each country is from 10 to 3,834,677, which indicates that there is a large standard deviation of 338,273 between countries. From the histogram, most of the coutries kept confirmed cases under 10,000 cases. The reason for large difference between the mean and median of death tolls is that there are some outliers. For example, US, Brazil, India and Russia have confirmed cases over 500,000 till 20th Jul. 2020, and US still accounts for the highest number of confirmed cases around the world.
covid.corr()
| Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Confirmed | 1.000000 | 0.934698 | 0.906377 | 0.927018 | 0.909720 | 0.871683 | 0.859252 | 0.063550 | -0.064815 | 0.025175 | 0.999127 | 0.954710 | -0.010161 |
| Deaths | 0.934698 | 1.000000 | 0.832098 | 0.871586 | 0.806975 | 0.814161 | 0.765114 | 0.251565 | -0.114529 | 0.169006 | 0.939082 | 0.855330 | -0.034708 |
| Recovered | 0.906377 | 0.832098 | 1.000000 | 0.682103 | 0.818942 | 0.820338 | 0.919203 | 0.048438 | 0.026610 | -0.027277 | 0.899312 | 0.910013 | -0.013697 |
| Active | 0.927018 | 0.871586 | 0.682103 | 1.000000 | 0.851190 | 0.781123 | 0.673887 | 0.054380 | -0.132618 | 0.058386 | 0.931459 | 0.847642 | -0.003752 |
| New cases | 0.909720 | 0.806975 | 0.818942 | 0.851190 | 1.000000 | 0.935947 | 0.914765 | 0.020104 | -0.078666 | -0.011637 | 0.896084 | 0.959993 | 0.030791 |
| New deaths | 0.871683 | 0.814161 | 0.820338 | 0.781123 | 0.935947 | 1.000000 | 0.889234 | 0.060399 | -0.062792 | -0.020750 | 0.862118 | 0.894915 | 0.025293 |
| New recovered | 0.859252 | 0.765114 | 0.919203 | 0.673887 | 0.914765 | 0.889234 | 1.000000 | 0.017090 | -0.024293 | -0.023340 | 0.839692 | 0.954321 | 0.032662 |
| Deaths / 100 Cases | 0.063550 | 0.251565 | 0.048438 | 0.054380 | 0.020104 | 0.060399 | 0.017090 | 1.000000 | -0.168920 | 0.334594 | 0.069894 | 0.015095 | -0.134534 |
| Recovered / 100 Cases | -0.064815 | -0.114529 | 0.026610 | -0.132618 | -0.078666 | -0.062792 | -0.024293 | -0.168920 | 1.000000 | -0.295381 | -0.064600 | -0.063013 | -0.394254 |
| Deaths / 100 Recovered | 0.025175 | 0.169006 | -0.027277 | 0.058386 | -0.011637 | -0.020750 | -0.023340 | 0.334594 | -0.295381 | 1.000000 | 0.030460 | -0.013763 | -0.049083 |
| Confirmed last week | 0.999127 | 0.939082 | 0.899312 | 0.931459 | 0.896084 | 0.862118 | 0.839692 | 0.069894 | -0.064600 | 0.030460 | 1.000000 | 0.941448 | -0.015247 |
| 1 week change | 0.954710 | 0.855330 | 0.910013 | 0.847642 | 0.959993 | 0.894915 | 0.954321 | 0.015095 | -0.063013 | -0.013763 | 0.941448 | 1.000000 | 0.026594 |
| 1 week % increase | -0.010161 | -0.034708 | -0.013697 | -0.003752 | 0.030791 | 0.025293 | 0.032662 | -0.134534 | -0.394254 | -0.049083 | -0.015247 | 0.026594 | 1.000000 |
Examing the correlation table between the numberical variables from the dataset covid, we can notice that the cumulative number of Deaths cases is positively correlated with most of the variables, except slightly negatively correlated with Recovered / 100 Cases and 1 week % increase. Deaths cases are strongly correlated with Confirmed, Confirmed last week, Active, Recovered, New deaths, New cases, 1 week change, and slightly correlated with the death rates.
This section focused on the relationship between the categorial independent variable WHO region and the dependent varibale Deaths. Using the function of .describe() and .groupby(), we can see the summary statistics of the Deaths cases in each region and plot a histogram respectively.
covid.groupby('WHO Region')['Deaths'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| WHO Region | ||||||||
| Africa | 48.0 | 254.645833 | 1025.735615 | 0.0 | 11.75 | 47.5 | 105.25 | 7067.0 |
| Americas | 35.0 | 9792.342857 | 29104.886288 | 0.0 | 9.00 | 115.0 | 2853.00 | 148011.0 |
| Eastern Mediterranean | 22.0 | 1742.681818 | 3603.198597 | 11.0 | 67.50 | 330.5 | 1131.75 | 15912.0 |
| Europe | 56.0 | 3770.428571 | 9289.098725 | 0.0 | 66.75 | 398.0 | 1686.75 | 45844.0 |
| South-East Asia | 10.0 | 4134.900000 | 10420.727224 | 0.0 | 7.25 | 31.5 | 2238.25 | 33408.0 |
| Western Pacific | 16.0 | 515.562500 | 1220.355848 | 0.0 | 0.00 | 14.5 | 200.25 | 4656.0 |
covid.boxplot(by ='WHO Region', column =['Deaths'],
figsize=(12,8),
vert=False).set_xlabel('Deaths cases')
Text(0.5, 0, 'Deaths cases')
With subgrouping the death cases for each country by WHO Region, Europe has the greatest amount of country wise samples while South-East Asia has the smallest amount.
Between different WHO Regions, location effects are added for the analysis of death cases. The summary statistics table indicates that the mean of Deaths tolls varies between region from 9,792 in Americas to 254 in Africa. In Americas, the median of the death cases for each country is only 115 ranked the 3rd among 6 regions, while its maximum death cases in one country, the US, is the highest with 148,011 among all 6 regions. From the boxplot, all the Deaths distribution of 6 regions are right-skewed. The speards of the Deaths cases for different regions vary. The Deaths distribution in South-East Asia and Americas regions are more variable than that in other regions. The Deaths distribution in Americas has the farthest outliers among all 6 regions which is the total death cases of the US.
Deaths and Confirmed last week grouped by WHO Region¶xy_vars=covid[['Deaths','Confirmed last week','WHO Region']].copy()
xy_vars.groupby('WHO Region').corr()
| Deaths | Confirmed last week | ||
|---|---|---|---|
| WHO Region | |||
| Africa | Deaths | 1.000000 | 0.992414 |
| Confirmed last week | 0.992414 | 1.000000 | |
| Americas | Deaths | 1.000000 | 0.982917 |
| Confirmed last week | 0.982917 | 1.000000 | |
| Eastern Mediterranean | Deaths | 1.000000 | 0.774768 |
| Confirmed last week | 0.774768 | 1.000000 | |
| Europe | Deaths | 1.000000 | 0.638749 |
| Confirmed last week | 0.638749 | 1.000000 | |
| South-East Asia | Deaths | 1.000000 | 0.992593 |
| Confirmed last week | 0.992593 | 1.000000 | |
| Western Pacific | Deaths | 1.000000 | 0.865231 |
| Confirmed last week | 0.865231 | 1.000000 |
With subgrouping the dataset into 6 WHO Regions, the relationship between Deaths and Confirmed last week is still positive in each group. There is a stronger relationship between the two variables in Africa, Americas and South-East Asia regions as the absolute values of the correlation are above the correlation without grouping. This implies that the countries with higher total number of Confirmed last week in these three regions are more likely to have higher total number of Deaths cases one week later.
The other variables in the covid dataset which has a strong correlation with the total Deaths cases should be considered for further analysis. For example, the total Confirmed cases could correlate with the total Deaths cases.
I intend to import the country level characteristis such as total population, land size, aged population over 65 years old and other information from a creditable source such as World Bank to help explain the relationship of the Deaths with Confirmed last week and other factors. With the information of total population of each country, I can take the next step to see the relationship between the population and Deaths cases. The aged population can also be a factor that influence the Deaths cases for each country. The location (latitude and longitude) of each country can also be an valuable information to plot the data in a more visualized way.
CDC. (2020a, February 11). COVID-19 and Your Health. Centers for Disease Control and Prevention. https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/index.html
CDC. (2020b, October 28). How Coronavirus Spreads. Centers for Disease Control and Prevention. https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/how-covid-spreads.html
Wang, F., Qu, M., Zhou, X., Zhao, K., Lai, C., Tang, Q., Xian, W., Chen, R., Li, X., Li, Z., He, Q., & Liu, L. (2020, July 3). The timeline and risk factors of clinical progression of COVID-19 in Shenzhen, China. Journal of Translational Medicine. https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-020-02423-8
WHO. (2020, December 23). A year without precedent: WHO’s COVID-19 response. World Health Organization. https://www.who.int/news-room/spotlight/a-year-without-precedent-who-s-covid-19-response
WHO. (2021, February 3). WHO Coronavirus Disease (COVID-19) Dashboard. World Health Organization. https://covid19.who.int
Continuing my analysis from Project 1, project 2 aims to visualize the relationship between Deaths and Confirmed last week for the spread of COVID-19 worldwide. The first part of project 2 focuses on analyzing the correlation between the dependent variable Deaths and the independent variable Confirmed last week, furthermore introducing WHO Region and pop_cut to divide the data into subgroups. In subgroups, I tried to fit regression lines to see any differece in their correlations. The second part of project 2 relies on mapping the independent and dependent variables, Confirmed last week, total population 2019 and Deaths. The interactive maps provide the comparison and findings in an intuitive way.
We expect to see a positive relationship between cumulative deaths and confirmed cases as the more confirmed cases in one country, the higher likelihood that the country has a higher death cases. In comparison with WHO Region, the deaths cases and confirmed cases are positively corelated as well. A country that is considered as high population division is more likely to have a higher confirmed cases and a higher death cases.
The html version of this notebook is in the link [https://drive.google.com/drive/folders/1-jkQBXFr9dOV5RLfjMmz7rl6BwK0B4PP?usp=sharing].
Based on the current covid dataset, I want to visualize the relationship between cumulative deaths and cases, which helps to predict the ratio between deaths and confirmed cases of each country. I am interested to see if the total population of the country have an effect on this relationship.
First, I imported and cleaned the population dataset of each country, total_pop, from the World Bank data in 2019. There are some problems raised when merging total_pop and covid. For example, some country names have extra string or different abbreviations. To solve this, I introduced a new package called pycountry_convert to create a new column for country_code in the covid dataset. I merged the total_pop with covid based on the condiction that either the country names or the country codes are matched. I classfied the population of each country into a categorical value, pop_cut, in terms of the approximate four quantiles of the total_pop distribution. The population division is for population above 30 million as high, for population between 9.5 million and 30 million as higher middle, for population between 2.3 million to 9.5 million as low middle and the rest as low. The merged dataset is stored in covid2 and contains 184 observations.
import pycountry_convert as coco
#population data
total_pop=pd.read_csv("/users/lucygu/downloads/total_pop.csv")
total_pop=total_pop[['Country Name','Country Code','2019']]
total_pop=total_pop.rename(columns={"2019": "total population 2019"})
total_pop['Country Name'] = total_pop['Country Name'].replace('Korea, Rep.', 'South Korea')
total_pop['Country Name'] = total_pop['Country Name'].replace('Yemen, Rep.' , 'Yemen')
total_pop['Country Name'] = total_pop['Country Name'].replace('Equatorial Guinea', 'Guinea')
total_pop['Country Name'] = total_pop['Country Name'].replace('Bahamas, The', 'Bahamas')
total_pop['Country Name'] = total_pop['Country Name'].replace('Congo, Rep.', 'Congo (Brazzaville)')
total_pop['Country Name'] = total_pop['Country Name'].replace('Congo, Dem. Rep.', 'Congo (Kinshasa)')
total_pop['Country Name'] = total_pop['Country Name'].replace('Egypt, Arab Rep.', 'Egypt')
total_pop['Country Name'] = total_pop['Country Name'].replace('Gambia, The', 'Gambia')
total_pop['Country Name'] = total_pop['Country Name'].replace('St. Lucia', 'Saint Lucia')
total_pop['Country Name'] = total_pop['Country Name'].replace('St. Vincent and the Grenadines', 'Saint Vincent and the Grenadines')
total_pop['Country Name'] = total_pop['Country Name'].replace('Lao PDR', 'Laos')
total_pop['Country Name'] = total_pop['Country Name'].replace('Kyrgyz Republic', 'Kyrgyzstan')
total_pop['Country Name'] = total_pop['Country Name'].replace('Slovak Republic', 'Slovakia')
#only country Eritrea doesn't have population in 2019, so I manualy fill in its population from 2011.
total_pop['total population 2019'] = total_pop['total population 2019'].fillna(3213972)
#add ISO-3 country code
covid['Country/Region'] = covid['Country/Region'].replace('US' , 'United States')
covid['Country/Region'] = covid['Country/Region'].replace('Taiwan*' , 'Taiwan')
covid['Country/Region'] = covid['Country/Region'].replace('Congo (Brazzaville)','Congo')
covid['Country/Region'] = covid['Country/Region'].replace('Congo (Kinshasa)','Democratic Republic of the Congo')
covid['Country/Region'] = covid['Country/Region'].replace("Cote d'Ivoire","Côte d'Ivoire")
def get_code(col):
try:
cn_a2_code = coco.country_name_to_country_alpha3(col)
except:
cn_a2_code = 'Unknown'
return cn_a2_code
covid["country_code"] = covid["Country/Region"].apply(lambda col: get_code(col))
covid["Country/Region"] = covid["Country/Region"].str.strip()#strip space at the beginning and the end
#merge covid data and population
#merge based on one or another condition
d_1 = pd.merge(total_pop,covid, how='inner', left_on = 'Country Code', right_on="country_code")
d_2 = pd.merge(total_pop,covid, how='inner', left_on="Country Name", right_on="Country/Region")
covid2 = pd.concat([d_1,d_2]).drop_duplicates()
covid2['total population 2019'] = covid2['total population 2019'].fillna(0)
covid2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 184 entries, 0 to 168 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country Name 184 non-null object 1 Country Code 184 non-null object 2 total population 2019 184 non-null float64 3 Country/Region 184 non-null object 4 Confirmed 184 non-null int64 5 Deaths 184 non-null int64 6 Recovered 184 non-null int64 7 Active 184 non-null int64 8 New cases 184 non-null int64 9 New deaths 184 non-null int64 10 New recovered 184 non-null int64 11 Deaths / 100 Cases 184 non-null float64 12 Recovered / 100 Cases 184 non-null float64 13 Deaths / 100 Recovered 179 non-null float64 14 Confirmed last week 184 non-null int64 15 1 week change 184 non-null int64 16 1 week % increase 184 non-null float64 17 WHO Region 184 non-null object 18 country_code 184 non-null object dtypes: float64(5), int64(9), object(5) memory usage: 28.8+ KB
covid2['total population 2019'].describe()
count 1.840000e+02 mean 4.103578e+07 std 1.487455e+08 min 3.386000e+04 25% 2.336704e+06 50% 9.393937e+06 75% 2.946295e+07 max 1.397715e+09 Name: total population 2019, dtype: float64
def func(x):
if 30000000 < x:
return 'high'
elif 9500000 < x <= 30000000:
return 'higher middle'
elif 2300000<x<=9500000:
return 'lower middle'
return 'low'
covid2['pop_cut'] = covid2['total population 2019'].apply(func)
Deaths and Confirmed last week by WHO Region¶I chose to create a scatter plot to see the relationship between Deaths and Confirmed last week by WHO Region. The pandemic spread across the countries at a fast rate and the cumulative cases are in exponential growth. So I decided to take log for both Deaths and Confirmed last week variables to respond to the skewness in large scales of Confirmed last week in some countries. The log-log plot would provide information on whether there is a power law relationship between two varaibles.
import seaborn as sns
import math
fig = plt.figure(figsize=(10,10))
ax = plt.gca()
splot=sns.scatterplot(data=covid2, x="Confirmed last week", y="Deaths",hue="WHO Region", palette="deep")
ax.set_yscale('log')
ax.set_xscale('log')
ymin=math.log10(6)
ax.set(ylim=(ymin, 270000))
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# Reference lines
x = np.arange(100,1000000)
y = 0.01*x
plt.plot(x,y,":",linewidth =2,color = "gray")
plt.annotate("1%",(x[-2],y[-1]),xycoords="data",fontsize=14,alpha = 0.5)
x = np.arange(100,1000000)
y = 0.005*x
plt.plot(x,y,":",linewidth =2,color = "gray")
plt.annotate("0.5%",(x[-2],y[-1]),xycoords="data",fontsize=14,alpha = 0.5)
x = np.arange(100,1000000)
y = 0.0025*x
plt.plot(x,y,":",linewidth =2,color = "gray")
plt.annotate("0.25%",(x[-2],y[-1]),xycoords="data",fontsize=14,alpha = 0.5)
x = np.arange(100,1000000)
y = 0.05*x
plt.plot(x,y,":",linewidth =2,color = "gray")
plt.annotate("5%",(x[-2],y[-1]),xycoords="data",fontsize=14,alpha = 0.5)
x = np.arange(100,1000000)
y = 0.1*x
plt.plot(x,y,":",linewidth =2,color = "gray")
plt.annotate("10%",(x[-2],y[-1]),xycoords="data",fontsize=14,alpha = 0.5)
x = np.arange(100,1000000)
y = 0.2*x
plt.plot(x,y,":",linewidth =2,color = "gray")
plt.annotate("20%",(x[-2],y[-1]),xycoords="data",fontsize=14,alpha = 0.5)
x = np.arange(100,1000000)
y = 0.02*x
plt.plot(x,y,":",linewidth =2,color = "gray")
plt.annotate("2%",(x[-2],y[-1]),xycoords="data",fontsize=14,alpha = 0.5)
plt.legend(title='WHO Region',bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('The log-log plot for Cumulative deaths cases and confirmed cases till last week, July 27, 2020, by WHO region')
Text(0.5, 1.0, 'The log-log plot for Cumulative deaths cases and confirmed cases till last week, July 27, 2020, by WHO region')
The log-log plot shows that the data follows a straight line which displays a power law relationship between Confirmed last week and Deaths. The positive relationship between Deaths and Confirmed last week indicates that for an country with a higher death tolls, it usually has higher confirmed cases. Meanwhile, for a country with a higher confirmed cases, it is more likely to have a higher death cases.
Moreover, I added several reference lines to exhibit the likelihood of deaths in terms of the confirmed cases as of July 27, 2020. The likelihood of death by COVID in most of countries is below 5%. The reference lines confirm that in Americas and Europe the motality risks for most of the countries cluester in the range of 2% to 10%. In Eastern Mediterranean the range of the risks is wider from 20% to 0.25% between each country. While it is noticeable in Africa the risks are below 5% and the number of deaths and confirmed cases are lower. For further analysis, I will divide the data into subplots by each WHO Region.
Deaths and Confirmed last week by WHO Region¶g = sns.lmplot(x="Confirmed last week", y="Deaths", col="WHO Region",
data=covid2, col_wrap=3,y_jitter=.03,sharey=False)
g.set(xscale = 'log')
g.fig.subplots_adjust(top=0.9) # adjust the Figure in rp
g.fig.suptitle('Cumulative deaths cases and confirmed cases till last week, July 27, 2020, by WHO region')
Text(0.5, 0.98, 'Cumulative deaths cases and confirmed cases till last week, July 27, 2020, by WHO region')
The facet log linear regression graph shows a positive corelation between cumulative confirmed and death cases in all 6 regions. However, it displays different y-axis range between subplots, which indicates the motality risks vary between regions. The Americas subplot is the one with the highest number of deaths shown on its y-axis. This means that it has a highest chance of deaths due to COVID in Americas region the next week, if someone is tested positive on the week of July 20th, 2020, followed by in Europe region. However, this graph also indicates Africa and Asia regions have lower risks of death while more than 60% of the world population live in the two continents. For further analysis, I took the total population for each country into account in addition to WHO Region.
Deaths and Confirmed last week by WHO Region and population of a country, pop_cut¶g=sns.lmplot(x="Confirmed last week", y="Deaths", col="WHO Region",hue='pop_cut', palette="Dark2_r",
data=covid2, col_wrap=3,y_jitter=.03,sharey=False,fit_reg=False, legend=True)
# Use regplot to plot the regression line and use line_kws to set line label for legend
sns.regplot(x="Confirmed last week", y="Deaths",
data=covid2.loc[covid2['WHO Region']=='Eastern Mediterranean'],
scatter=False, ax=g.axes[0]).set(xlabel=None)
sns.regplot(x="Confirmed last week", y="Deaths",
data=covid2.loc[covid2['WHO Region']=='Africa'],
scatter=False, ax=g.axes[1]).set(xlabel=None)
sns.regplot(x="Confirmed last week", y="Deaths",
data=covid2.loc[covid2['WHO Region']=='Europe'],
scatter=False, ax=g.axes[2]).set(xlabel=None)
sns.regplot(x="Confirmed last week", y="Deaths",
data=covid2.loc[covid2['WHO Region']=='Americas'],
scatter=False, ax=g.axes[3]).set(xlabel=None)
sns.regplot(x="Confirmed last week", y="Deaths",
data=covid2.loc[covid2['WHO Region']=='Western Pacific'],
scatter=False, ax=g.axes[4]).set(xlabel=None)
sns.regplot(x="Confirmed last week", y="Deaths",
data=covid2.loc[covid2['WHO Region']=='South-East Asia'],
scatter=False, ax=g.axes[5]).set(xlabel=None)
g.set(xscale = 'log')
g.fig.subplots_adjust(top=0.9) # adjust the Figure in rp
g.fig.suptitle('Cumulative deaths cases and confirmed cases till last week, July 27, 2020, by WHO region')
g._legend.set_title('Population Division')
The colored legend of each subplots shows that the countries classified as high popluation tend to have a higher deaths cases and confirmed cases in all 6 regions. In Americas, the population distribution accords closely with deaths cases. The more population the country has, the more COVID Deaths cases in that country. In other regions, some countries with high population have a lower COVID deaths cases than countries with lower population. In Africa, Eastern Mediterranean and South-East Asia, some countries with higher middle population have lower deaths cases than countries with low population. Especially, in Africa, there are some countries with high and high middle population that have small number of reported confirmed and deaths cases. This might show the limitation of the reported cumulative death and confirmed cases as the reported cases depend on
a country's testing ability and other complex reasons. For example, people who have pre-existing diseases are more vulnerable in the pandemic period which are not counted as COVID deaths cases.
I first imported the packages I need and the world map data from Natural Earth outlines the countries. I also noticed there are some country names in covid dataset that are not matched with the country in the map data. So I first clean the names and created a new column for country codes to be prepared for merging it with the map.
The worldcovid dataset was created based on either the country names or the country codes are matched. As for the range of the color legend, I chose to take log on the variables I want to display since the cumulative cases increased exponentially and population varies a lot between countries.
import geopandas as gpd
from shapely.geometry import Point
from pycountry_convert import country_name_to_country_alpha3
%matplotlib inline
# Grab low resolution world file
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
world["name"] = world["name"].str.title()
world["name"] = world["name"].str.strip()
covid2["Country/Region"] = covid2["Country/Region"].str.strip()#strip space at the beginning and the end
#merge covid data and world map
#merge based on one or another condition
d_1 = pd.merge(world,covid2, how='inner', left_on = 'iso_a3', right_on="country_code")
d_2 = pd.merge(world,covid2, how='inner', left_on="name", right_on="Country/Region")
worldcovid = pd.concat([d_1,d_2]).drop_duplicates()
#worldcovid.info()
#check for NA rows
#na=worldcovid[worldcovid.isna().any(axis=1)] #0 rows
The interactive graphs below is provided in html version for submission on Google drive. Here is the link to the folder [https://drive.google.com/drive/folders/1-jkQBXFr9dOV5RLfjMmz7rl6BwK0B4PP?usp=sharing] containing every interactive graphs and a html version for the complete notebook. I also provided individual link for every graphs.
Link:[https://drive.google.com/file/d/1w62uMsLW5tyaT3ICaZIUldstVA_GoM3P/view?usp=sharing]
import plotly.express as px
data = worldcovid.copy()
data['Deaths_Log'] = np.log(worldcovid['Deaths']+1)
fig = px.choropleth(data,
locations='country_code',
color='Deaths_Log', # a column in the dataset
hover_name='Country/Region', # column to add to hover information
hover_data=["Confirmed last week", "Deaths",
'Active', 'Deaths / 100 Cases','WHO Region'],
color_continuous_scale=px.colors.sequential.amp)
fig.update_layout(title_text="Heat Map - Cumulative COVID death cases around the world as of July 27, 2020")
fig.update_coloraxes(colorbar_title="<b>Color</b><br>Deaths Cases<br>Log Scale")
fig.update_layout(margin={"r":0,"l":0,"b":0})
fig.show()
fig.write_html("/users/lucygu/downloads/deaths.html")
From the interactive graph on the cumulative Deaths cases in log scale, it is easy to observe that the darkness of the red colors matches with the countries which has the highest number of Deaths from the table in project 1. Intuitively, one would observe that United States and Brazil have the darkest color in the map which means they have thte highest number of cumulative death tolls till July 27, 2020 than any other countries, followed by India, Mexico, France and some more. By putting the mouse onto the country, we can see the detailed information of each country including Confirmed last week, Deaths, Deaths / 100 cases etc.
Meanwhile, the graph is showing the geographic WHO region on where the countries locates. Americas has higher death tolls than any other regions as most of the continent is covered by dark red colors. Europe is also covered by dark red color which indicates higher death tolls for most of its countries. Clearly, there is a location effect on the death cases for each country. If a nearby country have higher death cases, it means the nearby country has higher confirmed cases and there is higher possibility that the deaths cases and confirmed cases in the country will increase as COVID spreads.
However, the death cases could be influenced by other factors such as population, the proportion of aging population, stringency of social distance policy in different countries, etc.
Link:[https://drive.google.com/file/d/1RlMitkc_Z3P1iSe-5PVB20KIL7ghL02p/view?usp=sharing]
data = worldcovid.copy()
data['Confirmed_Log'] = np.log(worldcovid['Confirmed last week'])
fig = px.choropleth(data,
locations='country_code',
color='Confirmed_Log', # a column in the dataset
hover_name='Country/Region', # column to add to hover information
hover_data=["Confirmed last week", "Deaths",
'Active', 'Deaths / 100 Cases','WHO Region'],
color_continuous_scale=px.colors.sequential.amp)
fig.update_layout(title_text="Heat Map - Cumulative COVID confirmed cases around the world as of July 20, 2020")
fig.update_coloraxes(colorbar_title="<b>Color</b><br>Confirmed Cases last week<br>Log Scale")
fig.update_layout(margin={"r":0,"l":0,"b":0})
fig.write_html("/users/lucygu/downloads/confirms.html")
fig.show()
From the interactive graph on the cumulative Confirmed cases till July 20, 2020 in log scale, it is easy to observe that the darkness of the colors matches with the countries which has the highest number of Confirmed from the table in project 1. By looking for areas colored with the darkest red, we can see that United States has the highest number of cumulative confirmed cases than any other countries. By putting the mouse onto the country, we can see the detailed information including Confirmed last week, Deaths, Deaths / 100 cases etc.
Meanwhile, the graph is showing the geographic WHO region on where the countries locates. Clearly, there is a location effect on the confirmed cases for each country. If a nearby country have higher confirmed cases, the country is more likely to have a similar large number of cumulative confirmed cases as COVID spreads. We can observe that countries that has large cumulative COVID confirmed cases mainly concentrated in Americas ,South-East Asia and Europe where those regions are covered with dark red color.
However, the confirmed cases could be influenced by other factors such as the testing ability, stringency of social distance policy in different countries, population, etc.
Link:[https://drive.google.com/file/d/1cwHqVc2ddZTL1K4bcf3oQ6q-8luvU9RO/view?usp=sharing]
data = worldcovid.copy()
data['pop_log'] = np.log(worldcovid['total population 2019']+1)
fig = px.choropleth(data,
locations='country_code',
color='pop_log', # a column in the dataset
hover_name='Country/Region', # column to add to hover information
hover_data=["total population 2019","Confirmed", "Deaths",
'Active', 'Deaths / 100 Cases','WHO Region'],
color_continuous_scale=px.colors.sequential.YlGnBu)
fig.update_layout(title_text="Heat Map - population division around the world as of July 20, 2020")
fig.update_coloraxes(colorbar_title="<b>Color</b><br>Population 2019<br>Log Scale")
fig.update_layout(margin={"r":0,"l":0,"b":0})
fig.show()
fig.write_html("/users/lucygu/downloads/pop_cut.html")
This map shows the population of each country for 2019 in log scale. Obiviously, the world's largest 5 countries are China, India, United States, Indonesia and Brazil as they are the darkest blue areas on the graph. The countries with largest population tend to have relative large cumulative Deaths, Confirmed last week and Confirmed cases, compared with two maps above and looking at each countries hover data. For example, United States, Brazil and Indian are among the countries with largest cumulative reported Deaths and Confirmed last week cases. The countries that are in yellow-green or blue-green color have smaller population and their cumulative cases are smaller correspondingly.
The previous visualization graph indicates that some countries in Africa with larger population reported smaller COVID cases. We can easily seen in this map that, for example, Nigeria, Egypt and Demoncratic Republic of the Congo have larger population than some of European countries or Canada, but their cumulative cases are much smaller. However, if you compare theses countries within Africa, their reported cases are indeed larger than other countries in Africa with smaller population. The exceptions may be the results of other factors that affect the cumulative cases such as the testing ability, stringency of social distance policy in different countries.
However, there are some expceptions against the agrument which larger population countries have larger cumulative cases. For example, China is the country with largest population in the world while its cumulative cases are lower than many countries with smaller population than China by putting the mouse onto the map. Canada is not in the top 20 countries with largest population shown as dark blue color, but its cumulative Deaths and Confirmed last week cases exceed countries with larger population such as Japan, China.
In Project 2, I took some further visualizing steps to demonstrate the relationship between cumulative Deaths and Confirmed last week. The analysis shows that the cumulative Confirmed last week is positively correlated with cumulative Deaths and the correlation varies in subgroups based on WHO region and pop_cut. Additionally, in each WHO Region, the correlation varies and countries with higher population tend to have higher reported cases. The location of the country in the map also affects the cumulative cases.
Generally, the cumulative Deaths cases is expected to be higher if the country has a high cumulative Confirmed last week cases. The more population a country has, the more cumulative Deaths will be reported. However, there are some exceptions that countries in Europe with relatively smaller population reported a lot of Deaths cases than other countries with larger population. In Africa region, countries reported a smaller number of Deaths cases even though it has the second largest population.
The relationship between Deaths and Confirmed last week can be interpreted as the likelihood of dying due to COVID. The risk varies between different countries and WHO Region. There are some limitations of the data which can affect the relationship between death and confirmed cases, therefore the prediction of the motality risk for each country. The reported cases relies on the different criteria and ability for testing of each country. There are people who doesn't show any symptoms and people who died with pre-existing diseases and COVID may not be counted as COVID-19 deaths cases. The deaths and confirmed cases could be affected by other factors such as stringency of the policy.
For further analysis, merging other country-level characteristics can help to explain the relationship and would be a complimentary for analysing the motality risk and for controling the spread of COVID. The testing rate and stringency of policy would be interesting to classify the countries and reveal some further findings.